Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(rust): Use ObjectStore instead of AsyncRead in parquet get metadata #15069

Merged

Conversation

mickvangelderen
Copy link
Contributor

@mickvangelderen mickvangelderen commented Mar 14, 2024

Fixes #15032

For the record, I do like the idea of the AsyncRead and AsyncSeek and AsyncWrite abstraction. We just want to eliminate the possibility of the implementation causing network connection resets. Maybe I have to split the object store deduplication from this PR.

@ritchie46 is there any way we can determine whether this solves #14384?

@github-actions github-actions bot added internal An internal refactor or improvement rust Related to Rust Polars labels Mar 14, 2024
@mickvangelderen mickvangelderen force-pushed the deduplicate-polars-object-store branch from 7f2815e to e123b3b Compare March 14, 2024 15:20
@mickvangelderen mickvangelderen force-pushed the deduplicate-polars-object-store branch from e123b3b to 66fe8d7 Compare March 14, 2024 16:10
@mickvangelderen mickvangelderen force-pushed the deduplicate-polars-object-store branch from baa9d56 to 860ec33 Compare March 14, 2024 16:38
}

/// Asynchronously reads the files' metadata
pub async fn fetch_metadata(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW https://docs.rs/parquet/latest/parquet/arrow/async_reader/struct.MetadataLoader.html provides an implementation of this which also handles prefetch and loading page indices

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tip @tustvold. @mickvangelderen can you take a look at that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can see, that implementation is similar. It allows passing a footer size hint to hopefully save one round trip and has tests that also check the number of fetches.

@ritchie46 what do we want to do with that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ritchie46 what do we want to do with that?

In what sense? The footer size or the tests? I would favor us the be conservatively here. Rather a larger request but have one, then needing two calls. But that's my gut feeling.

Copy link
Contributor Author

@mickvangelderen mickvangelderen Mar 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For that to make sense we need to have a good idea about how large the footer metadata is in the common case. We are making a trade-off between latency and throughput. If we request a large amount of data so that the footer metadata is likely contained within it, we are possibly waiting to download data that we are going to throw away again (the part that we are downloading but do not use). If we request too few bytes we are affected by both throughput and latency. If we request exactly enough bytes we are affected by latency but not the throughput for bytes that we throw away.

What is most efficient depends on the latency, throughput and footer byte size.

Having to read files backwards is not great. It would be nice if file formats optimize for reading rather than writing. If we take one step back, the real solution is to use a file format that can be read efficiently over a network connection.

Anyway, I am happy to reimplement the optimistic fetch. What should the pre-fetch size be? Should it be configurable (env or parameter)?

@tustvold
Copy link

For the record, I do like the idea of the AsyncRead and AsyncSeek and AsyncWrite abstraction.

FYI there's work in flight to remove the final use of these traits in ObjectStore, I'd be interested in any thoughts on this - apache/arrow-rs#5500

@ritchie46
Copy link
Member

For the record, I do like the idea of the AsyncRead and AsyncSeek and AsyncWrite abstraction. We just want to eliminate the possibility of the implementation causing network connection resets. Maybe I have to split the object store deduplication from this PR.

I don't. It is really not clear how many calls that will lead to. As it would work for any generic async function.

@ritchie46
Copy link
Member

@ion-elgreco had a problem, but they are hard to reproduce locally. If we have this PR ready we could ask him to take it for a spin.

@mickvangelderen
Copy link
Contributor Author

mickvangelderen commented Mar 15, 2024

FYI there's work in flight to remove the final use of these traits in ObjectStore, I'd be interested in any thoughts on this - apache/arrow-rs#5500

I guess AsyncRead + AsyncSeek would not allow you to spawn two connections and read from different parts of the same file concurrently if you wanted to. So maybe those traits don't give the control that we need indeed.

Seems like a nice PR, improving features while removing more code than you're adding is often a good sign.

@ion-elgreco
Copy link
Contributor

@ritchie46 let me know when I can try it out :)

Copy link
Member

@ritchie46 ritchie46 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great Mick!

}

/// Asynchronously reads the files' metadata
pub async fn fetch_metadata(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ritchie46 what do we want to do with that?

In what sense? The footer size or the tests? I would favor us the be conservatively here. Rather a larger request but have one, then needing two calls. But that's my gut feeling.

@ritchie46 ritchie46 merged commit 4933040 into pola-rs:main Mar 15, 2024
18 checks passed
@mickvangelderen mickvangelderen deleted the deduplicate-polars-object-store branch March 15, 2024 13:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal An internal refactor or improvement rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deduplicate Polars ObjectStore wrapper between IPC and Parquet
4 participants